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An efficient computational scheme devised for investigations of ground state properties of the 
electronically correlated systems is presented. As an example, (H 2 ) n chain is considered with the 
long-range electron-electron interactions taken into account. The implemented procedure covers: 

(i) single-particle Wannier wave-function basis construction in the correlated state, (it) microscopic 
parameters calculation, and (m) ground state energy optimization. The optimization loop is based 
on highly effective process-pool solution - specific root-workers approach. The hierarchical, two-level 
parallelism was applied: both shared (by use of Open Multi-Processing) and distributed (by use of 
Message Passing Interface) memory models were utilized. We discuss in detail the feature that such 
approach results in a substantial increase of the calculation speed reaching factor of 300 for the 
fully parallelized solution. The elaborated in detail scheme reflects the situation in which the most 
demanding task is the single-particle basis optimization. 
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I. PHYSICAL MOTIVATION: EXACT 

DIAGONALIZATION + AB INITIO METHOD 

Electronically correlated systems are important both 
from the point of view of their unique physical properties 
and from nontrivial computational methods developed 
to determine them. The latter cover methods based on 
the Density Functional Theory (DFT) with the energy 
functional enriched by the correlation terms - the on-site 
repulsion U in the Hubbard model [T] and the Hund’s 
rule term in the case of orbital degeneracy. Often, they 
are incorporated into either DFT or the Dynamic Mean 
Field Theory (DMFT) approach supplemented with the 
LDA-type calculations (see e.g. pjj]). On the other hand, 
the Configuration-Interaction (Cl) method does not suf¬ 
fer from the well-known double counting problem\ 2 [2], 
inherent in the DFT+U or LDA+DMFT methods. An¬ 
other approach, similar in its spirit to the Cl method, for¬ 
mulated as a combination of the first- and second quanti¬ 
zation (FQ, SQ respectively) formalisms was elaborated 
in our group in the last decade and termed the Exact 
Diagonalization Ab Intito (EDABI) approach [3] ;3J. 
This method allows for a natural incorporation of the 
correlation effects consistently by the advantages of us¬ 
ing the SQ language so that the double-counting problem 
does not arise at all. Also, by construction, it includes 
the Pauli principle for the fermionic systems. In con¬ 
trast to Cl the EDABI approach avoids any direct dealing 
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with the many-body wave function expressed via a linear 
combination of the Slater determinants [3]. Instead, it is 
based on the many-particle quantum states constructed 
in the occupation number representation [5] - standard 
procedure for the SQ formulated problems. 

The application of EDABI was found promising in 
view of research devoted to the hydrogen molecular sys¬ 
tems with inclusion of interelectronic correlations [B], 
nano-clusters |7, and to atomic hydrogen metallization 
|S]. As the many-particle state is explicitly written in 
the occupation-number representation (Fock space), the 
starting Hamiltonian is formulated in the SQ language. 
Electronic correlations are then automatically included 
in the modeled system. However, in the Hubbard-like 
starting Hamiltonians the knowledge of micro¬ 

scopic parameters, such as the on-site energy, the inter¬ 
site hoppings, and the Coulomb repulsion magnitudes are 
regarded as input information. These parameters are of¬ 
ten estimated indirectly. With this limitation, specific 
phase-diagrams are constructed and the phase bound¬ 
aries of interest are determined as a function of those mi¬ 
croscopic parameters which are not directly measurable 
(cf. e.g. [ 1 3IIT5] I. In EDABI we take a different route: 
relatively simple and small systems are to be described 
consistently in the sense that the microscopic parameters 
are obtained explicitly as an output of an appropriate 
ab-initio variational procedure. Therefore, the EDABI 
approach should be regarded as an ab-initio method but 
with the single-particle wave functions being determined 
self-consistently in the correlated state. In this manner, 
the problem solution is reversed with respect to that in 
either LDA+U or LDA+DMFT. Namely, we first formu¬ 
late the Hamiltonian and diagonalize it in SQ formalism 
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and determine the single-particle wave function only as 
a second step. However realistic, such an approach to 
the electronically correlated systems implies a substan¬ 
tially greater computational complexity, since the varia¬ 
tional optimization consists of (?) microscopic parameters 
calculation and (ii) concomitant Hamiltonian-matrix di- 
agonalization. We address here the issue (i) presenting 
how the modern High Performance Computing (HPC) 
cluster architecture can be utilized in the context, where 
the number of microscopic parameters is substantial if 
not large, and their calculation is one of the potential 
bottlenecks in the whole computational procedure. We 
also provide an example how a many-body problem at 
hand may be supplemented with the two-level parallelism 
in an intuitive manner. We do not discuss either the 
methodology related to the point (ii) or to its algorith¬ 
mic aspect or else, to technical opportunities provided by 
e.g. recent fast development of Graphic Processing Units 
(GPU) computational techniques that are also in the area 
of interest mm- Nonetheless, as it becomes clear be¬ 
low, heterogeneous solutions are also easily applicable in 
our scheme. 

The structure of the paper is as follows. In Sec. II we 
describe briefly the EDABI method (cf. Appendix A for 
details) and emphasize the computational complexity as¬ 
pects. Next, in Sec. |III[ we show how the process-pool 
concept enhanced by the two-level parallelism forms a 
natural solution. In Sec. |IV| we present the outcome of 
its implementation: results of calculations carried out for 
( H 2 ) n exemplary system and discuss the achieved speed¬ 
up when compared to the reference single CPU computa¬ 
tions. In Appendix [B] we discuss the convergence of our 
model for the case of infinite systems. 


II. COMPUTATIONAL METHOD - EDABI 


As stated in the foregoing Section, the computational 
method considered here is based on the EDABI method, 
comprehensive description of which can be found e.g. in 
[3] . It allows for consideration of realistic, electronically- 
correlated nanosystems within the framework of the com¬ 
bined first- and second-quantization formalisms. Below 
we sketch this method (for details see Appendix [A} . 


A. Second-quantization aspect 

For the purpose of calculating the ground-state energy 
of given system we start with the second-quantization 
language [5j 1X81120] . We introduce the fermonic anihila- 
tor(creator) algebra by imposing the anticommuta¬ 

tion relations among them, namely 

{C> c] a ,} = {c ia , c ja ,} = 0 and {cJ CT , c ja ,} = 5 l3 8 aa ', 

(1) 


where i and j denote sites (nodes) of a fixed lattice, 
<r, a' = ±1 are the spin quantum number, and the anti¬ 
commutator is { A , B} = AB + BA. 

We represent the many-particle basis states {| ( E > A;)} on 
the lattice of A sites in the Fock space uni in the following 
manner 

I®** = n 4 n 41 °) > ( 2 ) 

where and f2^ are the subsets of sites occupied by 
fermions with A| and A^ particles respectively, and |0) 
is the vacuum state (with no particles), with (0| 0) = 1. 
Explicitly, 

|$) = J0,1, ••.,!) (8 jl,0, ..•,!) = (3) 

spin t spin 4- 

=41 • • • 1°) ■ 

With this concrete (occupation-number) representation 
of an abstract Fock space we define next the microscopic 
Hamiltonian of our interacting system of fermions. 


B. Definition of the physical problem 

We take the real-space representation with the starting 
field operators in the form 


’M r ) = Yl Wi ^ x<r£ i^ ( 4 ) 

i 

where Wi( r) is the single-particle wave function for 
fermion (e.g. electron) located on i-th site, x<j is the s pi n 
wave function (cr = ±1) with global spin quantization 
axis (z-axis). In general, the many-particle Hamiltonian 
is defined in the form 

li=Y^ [ <i 3 r^( r )Hi(r)^(r) (5) 

a J 

+ d 3 rd 3 r'i>l(r)i’l,(r')V(r - r')’IV(r , )’M r ), 

crcr' 

where Hi is the (spin-independent) Hamiltonian for a 
single particle in the milieu of all other particles and 
U(r — r') is the interaction energy for a single pair. For 
the modeling purposes we assume that Hi is expressed 
in the atomic units (h = e 2 /2 = 2 m e = 1, where e is the 
charge of an electron and m e is its mass) and expresses 
the particle kinetic energy and the attractive interaction 
with the protons located at {RJ, he., 

Ns 9 

H,(r)“-V“-X: ]srr7 |, (6) 

2=1 1 1 1 

where Ns is the number of sites, whereas 



( 7 ) 
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represents the Coulomb repulsive interaction between 
them. Substituting Q into ([5]) we obtain the explicit 
second-quantized form of the Hamiltonian |5] [2D] he., 

^ = E E ''j'W'j" + E E (8) 

ij cr ijkl a,a' 


where t l;j and Vjj^i are integrals associated with the one- 
and two-body operators respectively 


tij — 


Vijkl — 


Wi{ r) Til Wj (r 
^ d 3 r w*(r)7ii(r)wj(r), 
Wi(r)wj(r') V w k (r)wi(r') 

- 73 „ j 3 J 


(9a) 


(9b) 


d 3 rd 3 r' w*(r)w*(r')V(r — r')wk(r)wi(r'). 


The first term contains the single-particle part composed 
of the atomic energy e* = tu, as well as expresses the 
kinetic (hopping) part with tij (i ^ j) being the so- 
called hopping integral. The second expression contains 
intraatomic (intrasite) part of the interaction between the 
particles ([/* = Van - the so-called Hubbard interaction), 
and the intersite (interatomic) interaction (Kij = Vijij - 
the last important term for the purposes here). A remark 
is in place here: when the single-particle basis {w,;} is as¬ 
sumed as real, the last two two-body interaction terms 
are the exchange-correlation energy (Jij = Vijji) and the 
so-called correlated hopping (V t j = Vijjj). 

The Hamiltonian 0 , with inclusion of all the two- 
site terms only, can be rewritten in the following form, 
with the microscopic parameters contained in an explicit 
manner, i.e., 


^ = E + ?> E MioA’o- + 7, E ( 10 ) 

i,cr j i,cr 

■ E - y 'A- s ; + \ E ( K v J >) Mi 

+ E J V d lAl £ jl d jt + E Vi + £ jAa) • 


The basis {wi(r)} Q in 0 needs to be orthonormal, i.e. 
orthogonal and normalized to unity. In the next Section, 
we describe how to construct the basis satisfying this con¬ 
dition. In summary, by solving (diagonalizing) T~L we un¬ 
derstand finding the optimal many-particle configuration 
with a simultaneous single-particle basis {wi(r)| deter¬ 
mination. Typically EMaim the parameters e* = e, 
tij, U, are regarded as extra parameters. Here we 
calculate them explicitly along with the diagonalization 
in the Fock space at the same time. 


C. Basis orthogonalization as a bilinear problem 

We require the orthonormality of the set of the single¬ 
particle wave functions {u>i(r, a)}, i.e., set the conditions 

(iOi(r)| Wj(r)) = (11) 

/ d 3 r w(r— ~Rj)w(r — Hj) = Sij, 

JW 3 

where 5ij is the Kronecker delta. Note that a will play a 
role of variational parameter specifying the way of con¬ 
structing the basis (cf. Sec. E33 Namely, the single¬ 
particle wave functions (Wannier functions) are approx¬ 
imated by a finite linear combination in a selected set. 
These wave functions describe the single-electron states 
centered on every atomic/ionic site, i.e., at positions 
{Ri}. Such approach is related to the tight-binding ap¬ 
proximation (TBA |22|1, where the atomic orbitals com¬ 
posing Wi are represented by e.g. the Slater-type orbitals 
(STO). For the purpose of the present model analysis, 
only the Is Slater functions are taken into account, i.e., 

ify(r) = J— e -“l r - R 'l, (12) 

V 7r 

where a is the inverse wave-function size. Similarly to 
|I23] , for each position i we construct linear combination 

L 

Wl (r) = ^^U)(r), ( 13 ) 

3=0 

where {/fy } compose a set of (L + 1) mixing coefficients, 
and 7 r,; : {0,. .., L} —► TV), is the function mapping in¬ 
dexes to the neighborhood A/) of the site (node) i located 
at Ri. 

Note that in general Vhnij) may be a sum over Slater 
functions in the neighborhood, which varies the number 
of /3 coefficients and the number of nodes in the neigh¬ 
borhood Mi. This circumstance does not influence the 
discussion, but is of crucial importance when the scheme 
is implemented numerically. Also, the new basis {wi(r)} 
is orthogonal in the neighborhood Mi- 

We are looking for the set of {/fy}, orthonormaliz- 
ing the basis {wi(r)} for given geometry (effectively de¬ 
scribed by set of ionic coordinates Rj) and for the arbi¬ 
trary inverse wave-function size a. In order to achieve 
this we replace the original problem Vj £ {ni(k)\k £ 
{ 0 , 1 ,-..,£}} 



(14) 


with the equivalent set of bilinear equations 



( 15 ) 
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where 

/ Pm(o) \ 

/^7 Ti ( 1 ) 

i t = ; , ( 16 ) 

V ) 

and the overlap integrals are 

= ( d3r &,(«) ( r ) ^(rn) (r), (17) 

JR3 


the multidimensional optimization space. The EDABI 
method is based on the variational principle within which 
the single-particle wave functions at given atomic con¬ 
figuration {R.i} are optimized to determine the ground- 
state energy of the correlated system. From the computa¬ 
tional point of view, four main tasks are to be performed 
in a single iteration, i.e., for a given trial value of a: 

1. Single particle basis ortonormalization - solution of 
L + 1 dimensional bilinear set of equations. 


with = 1. For given geometry and the inverse 

wave-function size a we solve the system (151 numeri¬ 
cally. 

The computation of the two-body integrals (cf. 
Eq. (9b ) ) must be performed in a general case numeri¬ 
cally (see e.g. [7j and citations therein). Therefore, STO 
are usually approximated by their expansion in the so- 
called Gaussian basis, namely 


i’i (r) 


a 


3/2 


E(^f< 


-a 2 r^|Ri-a| 2 


(18) 


with Nq being the parameter describing number of Gaus¬ 
sian functions taken into account and the adjustable set 
{r a } is obtained through a separate procedure [7]. Exact 
or approximate ground state properties (i.e. ground state 
energy, structural properties, electronic density, etc.) are 
obtained when the eigenstate corresponding to the lowest 
many-particle eigenvalue is determined with the diago- 
nalization performed in the Fock space. In the context 
of EDABI, exact methods were successfully applied, e.g., 
the Lanczos mi algorithm for the matrix diagonalization, 
executable for nanosystems mm- 

Although originally EDABI was formulated for finite- 
size systems, the scheme can be regarded as a general 
variational procedure. According to its generic charac¬ 
ter, it is applicable in combination with another corre¬ 
lation oriented approach, dedicated to the approximate 
Hamiltonian diagonalization. As an example, the bulk 
systems with proper translational symmetries were ana¬ 
lyzed [HJ EH, based on the modified Gutzwiller Approx¬ 
imation (SGA). One may incorporate other diagonaliza¬ 
tion schemes applicable to the EDABI method. Here we 
consider only the scenario, according to which diagonal¬ 
ization in the Fock space is performed exactly - i.e., the 
Hamiltonian matrix is generated with the help of basis 
([2]) and diagonalized in terms of iterative (e.g. Lanczos) 
numerical algorithm. 


D. Optimization procedure and computational 
complexity 

The set {^(r, a)} describes the system in question: 
the interaction parameters {Vijki} and intersite hoppings 
{Uj}. On the other hand, there are independent pa¬ 
rameters < a, {Ri} (, where {R-i} together with a form 


2. Computation of the one-body microscopic param¬ 
eters - scaling as 0(L 2 NqNs) . 

3. Computation of the two-body microscopic param¬ 
eters - scaling as O^L^Nq). 

4. Hamiltonian diagonalization - dependent on the se¬ 
lected approach (exact, mean-field, Gutzwiller Ap¬ 
proximation, etc.). 


The tasks corresponding to [2] and [3] are central to the 
subsequent considerations. While for relatively simple 
models, such as one band Hubbard model, there are 
only three integrals to compute (the nearest-neighbor 
hopping, the atomic reference site energy, and the on¬ 
site electrostatic repulsion), this is not the case in the 
situation, in which a more complicated Hamiltonian de¬ 
scribes our system. The extended Hubbard model (see 
Sec. IVI, where the non-local electron-electron interac¬ 
tions are taken up to some cut-off distance, is associated 
with the increasing number of the two-body integrals to 
be computed. However, also for the multiband Hubbard 
model case, the number of hopping integrals increases as 
0(N 2 ), where N b is the number of bands. Therefore, an 
effective scheme allowing to obtain - possibly quite large 
- set of microscopic parameters in a run-time, is desired. 
In the following Section we propose an explicit solution 
of this last issue. 


III. PROCESS-POOL SOLUTION AND 
TWO-LEVEL PARALLELISM 

As we said above, the standard task is to diagonalize 
Hamiltonian © defined in the Fock space (occupation- 
number representation). This means, to determine the 
ground-state energy for given values of the microscopic 
parameters: e*, iy, U t , and Kjj (in general, Jij and V tJ 
as well). The principal work we would like to under¬ 
take here is to determine the renormalized wave functions 
{w.j(r)}.-_■! N in the resultant (correlated) ground state. 
The first aspect of the whole problem presents itself as 
an equally important part, as only then the ground state 
configuration of our system can be defined physically, i.e., 
as a (periodic) system with known lattice parameter (in¬ 
terionic distance). In the remaining part both aspects of 
the optimization problem are elaborated together with 
concomitant technical details provided. 
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A. Optimization loop 


We focus our analysis according to the scenario that 
the computational time spent on the diagonalization in 
the Fock space is negligible when compared to the calcu¬ 
lation of the two-body integrals appearing in the calcu¬ 
lation of microscopic parameters. For the sake of clarity, 
let us rewrite Hamiltonian (|8]) in more compact form 


m ij 


—rn:ij Om\ij - 


(19) 


where S m;ij - G {ei,tij,Ui,Kij,Vij,Jij} and sym¬ 

bolizes the operator part, e.g., Ot-ij = cl a Cj a . One 
should note that the parameter set B m;lJ must be cal¬ 
culated in each iteration step during the optimization 
procedure. Computation of the microscopic parameters 
can be performed independently which in turn provides 
an opportunity for its acceleration by means of the par¬ 
allelism application. 

Let us consider some generic optimization procedure 
OP({Ri}) returning the minimal energy at given accu¬ 
racy, as a function of structural parameters. In OP the 
system energy is sampled as a function of a so we denote 
it as Eg{oc). Taking into account that X/ a EG(ot) is not 
obtainable in general case, the optimization scheme en¬ 
coded in OP relates to a non-gradient method, e.g., the 
golden-search for the one-dimensional case. The com¬ 
putation of Eg{u) (sampled by OP) consists of calcula¬ 
tion of combined with the Hamiltonian matrix 

diagonalization. Within our approach, the computation 
speed-up is achieved by implementing the process-pool 
or the root-worker processes. This solution might be re¬ 
garded as a thread-pool pattern, but constructed within 
the framework of the distributed memory model. Work¬ 
ing threads are replaced by the processes - let us call 
them workers - which may communicate by the utiliz¬ 
ing the Message Passing Interface (MPI) - as it is done 
in our implementation. Workers remain in the infinite 
loop, monitoring signal from the root process which in 
turn is responsible for the job triggering and synchro¬ 
nization. It can also participate in the calculations - in 
our case it performs matrix diagonalization. Depending 
on what kind of signal is sent (in certain protocol es¬ 
tablished for the communication purposes between work¬ 
ers and root) workers: (i) wait, ( ii) start to compute, 
(in) break and exit from the loop. Since (ii) might be 
considered as a generic task, one can see that proposed 
approach is extendable to include diagonalization - e.g., 
if one deals with the big block-diagonal matrices, each 
block could be diagonalized independently by the worker 
processes. Process-pool is supposed to be efficient as¬ 


suming that the following principal condition is fulfilled: 
the task performed by each of the worker processes (or 
at least by most of them) is computationally the most 
expensive part, particularly if it shadows the communi¬ 
cation latency. 

Each process in the process-pool may utilize - if there 
are available resources - shared memory model. Thus 
the solution benefits from two-level parallelism, where 
worker-process parent thread forks into working threads, 
allowing in turn to perform each task faster (integral 
computation in this context). In our case the elements 
from the set {S m ;ij} are distributed into the sub-sets 
I p = } p where p denotes the processor id. The 

I p relate to task stack assigned to p-th process. The 
sub-sets construction should be performed carefully to 
keep well-balanced workload on processor, e.g. providing 
equal distribution of the integrals calculation among I p . 
The process-pool consists of P + 1 processes, P of them 
are workers and one - as mentioned - is a root. As follows 
from the scheme (cf. Fig. [l]), process-pool is applied to 
the EDABI optimization loop. The OP procedure per¬ 
forms a space sampling along non-gradient optimization 
scheme. Each trial -Eg computation demands parame¬ 
ters update and integrals calculations. The latter one 
exploits parallelism at two levels. Each of P processes 
computes integrals grouped in its assigned I p subset and 
each of the two-body integral is calculated in the nested 
(fourth times) loops which are collapsed in terms of the 
utilization of the openMP framework. The whole com¬ 
putation originates from the need to determine the two- 
body integrals (9b) which are expressed as 


Vijkl 


(wiWj\V\w k wi) = (20) 

^ ' PpPqPrPs (V’tt; (p) VVj (?) I ^ I VV*, (r) VVi (s)) = 

pqrs 


E 

pqrs 


\rPqrs 
v ijkl ’ 


where if represents the Gaussian contraction. Ele¬ 
ments v^iy are computable independently; therefore 
Vijhi is obtained with the help of openMP reduction 
clause. Computed integrals are gathered in a single ar¬ 
ray in terms of MPI_Gather function or optionaly of 
MPI_ Allgather if necessary (which potentially may in¬ 
crease communication latency). Then, Hamiltonian ma¬ 
trix is updated with the proper values and the diago¬ 
nalization step starts. As an output of diagonalization, 
the trial Eq is computed and processed in OP. Our 
implementation bases on MPI and openMP, though its 
scheme is generic and might be implemented by means of 
any of known technologies or self-made implementations 
as well. 
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FIG. 1. Process-pool solution applied for the optimization procedure (OP) in the context EDABI method. 
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FIG. 2. (H 2 )n molecular chain, its parametrization with 

possible hoppings from/to central (blue) molecule. Green 
molecules are included as the background field related to the 
system by the periodic boundary conditions. 


IV. EXEMPLARY SOLUTION: MODEL OF ( H 2 ) n 
CHAIN 

We test our solution by means of analysis performed 
for the chain consisting of n = 3 hydrogen molecules 
stacked at intermolecular distance a , with the molecule 
bond-length R, and the tilt angle 6. We regard this 
configuration as a part of periodic system (cf. Fig. |2j. 
Hydrogen molecular chains are interesting in view of the 
crucial role of electronic correlations in the molecules and 
related low-dimension systems GBE1E]. The stability of 
the hydrogen molecular system was studied by means 
of variety of methods, e.g. DFT (25, or Self-Consistent 
Field (SCF) [26, 27], also in the context of the existence 
of superfluidity |28| . 

We discuss the molecular hydrogen chain within the 
so-called extended Hubbard model. This means that the 
interactions associated with the different atomic centers 
( Kij ) are taken into account. Eventually, the electronic 



FIG. 3. Ground state energy (per atom) for ( 1 / 2)3 molecular 
chain as a function of the intermolecular distance a. Note 
the convergence to analytical solution [6] for the separate free 
molecule limit when a —> 00 . 

part of Hamiltonian, with the ion-ion interaction explic¬ 
itly included, can be rewritten then as composed of the 
three parts: 

R-Hubb = Y e ^i + Y ^ C L c ja + U Y ( 21a ) 

i ijcr i 

'^ext — ^ Hubb “t“ ^ ^ ^ , (21b) 

ij 

'Htot = 'Hext + 0 " -p | = 'Hext + (21c) 

^ ij I — I 
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FIG. 4. Inverse atomic orbital size a versus the intermolecular 
distance a. Note the convergence to the analytical solution 
[6] for the separate free molecule limit when a —> oo. 


where the first two terms in ( |21a[ ) represent the single¬ 
particle energy with all possible hoppings tij (to the 
fourth nearest neighbor, see Fig. [2| and are calculated 
with respect to the background field (see Sec. |IV A for 
details), U is on-site Coulomb repulsion (Hubbard term), 
Kij are intersite Coulomb repulsive interactions between 


supercell and the background sites (see j7] and Ap¬ 
pendix [B] for details), is the proton-proton repul¬ 
sive interaction. Despite its relative simplicity, the sys¬ 
tem exhibits non-trivial properties. However, as we fo¬ 
cus mainly on the computational aspects, we present here 
only the basic physical properties. Computational per¬ 
formance tests of our solution are undertaken for arbi¬ 
trarily chosen molecular bond-length R = 1.43042(ao), 
corresponding to the equilibrium value obtained by us 
previously for a single H 2 molecule j6]; also 9 = ir/2. The 
total number of electrons is equal to the number of atomic 
centres (6 for n = 3). We test Eq against the varying 
intermolecular distance a (as shown in Fig. [3| obtaining 
the van der Waals-like behavior of the total energy, as 
expected dm m i2Z]. The single (spatially separated) 
H 2 -molecule ground-state energy is reproduced asymp¬ 
totically for a —> 00 , as marked in Fig. [3] For the sake 
of completeness, we present in Fig. [I] the inverse atomic 
orbital size a. In Fig. [5] we plot the contours of the elec¬ 
tronic density n(r) as the cross-section on the X — Y 
plane close to the configuration related to the minimal 
value of Eq. A very important feature of this solution 
worth mentioning is that as a diminishes and approaches 
R we observe a discontinuous phase transition from the 
molecular to the atomic states, but this feature of the 
results are discussed elsewhere [29]. Also, the minimum 
energy provides a stable configuration against the disso¬ 
ciation into separate molecules (cf. Fig. [ 3 ]). 


R- 1 -43042 (a 0 ) a- 4.1 (a 0 ) 0 = 31/2 

0.5 
0.4 
0.3 

1— 

0.2 c 
0.1 
0 

-6 -4 -2 0 2 4 6 



r x (a 0 ) 


FIG. 5. Electronic density n(r) projected on X — Y plane 
of the aligned H 2 molecules for the intermolecular distance 
close to the equilibrium value (marked by the vertical dashed 
line in Fig. [3]). The values of the structure parameters are 
specified. 


A. Convergence study 


We have performed the convergence study to obtain 
parameters suitable for the speed-up analysis of the im¬ 
plemented approach. There are two most important com¬ 


ponents playing crucial role: number of Gaussian func¬ 
tions Nq taken in contraction (18) and the size of peri¬ 
odic background-field of the super-cell. The former does 
not require additional comment; it is not the case for 
the latter. With the periodic boundary conditions being 




























imposed, a proper cut-off distance for the interactions 
must be also established. The atomic energy eo becomes 
in general lower when a larger number of ionic centers are 
taken into account (i.e., the larger cut-off distance), then 
the contribution from the electron-ion becomes stronger. 
On the other hand, analogical in nature but opposite in 
sign effects originate from the Coulomb ion-ion and the 
electron-electron repulsions. In the limit r cu t 0 ff —f oo 
both contributions cancel each other, as discussed in the 
context of EDABI in m and, for the sake of complete¬ 
ness, in Appendix [B] One may note that the just de¬ 
scribed behavior is similar to the cancellation effect ob¬ 
served in the jellium model ED- We define M as cut-off 
parameter, describing the size of background field as: 

M = rcutoff (22) 

a 

In Fig. § we plot the system total energy as a function 
of intermolecular distance, close to the energy-minimum 
configuration. It is clear that if smaller cut-off distance 
is selected, the energy is underestimated. If cut-off was 
chosen to be > 150a, the consecutive energies differ by 
less than the assumed numerical error in Lanczos ma¬ 
trix diagonalization procedure (i.e. 10 ~ i Ry ~ IrrieV). 
Therefore, further calculations were carried out for M = 
250, what results in 510 integrals to be calculated after 
reductions caused by the system symmetries. As it fol¬ 
lows from Fig. [7] Nq = 9 is the number of Gaussians, 
when the energy becomes convergent, therefore the sub¬ 
sequent analysis corresponds to Nq = 9 and M = 250. 


B. Strong scaling for MPI+openMP solution 


Taking into account that we have one electron per 
atomic center in a two band system (molecule consists of 
two hydrogen atoms), construction of the basis for three 
molecules leads to 924 basis states. The diagonalization 
in terms of the iterative algorithm - in this case Lanczos 
- is not a challenge, especially if only the lowest eigen¬ 
value is desired. A remark is necessary here: system de¬ 
scribed by substantially larger Hamiltonian matrices can 
also be treated efficiently in the framework of the elab¬ 
orated scheme. However, more sophisticated approaches 
(cf. Sec. Ill AI or approximate methods are then indis¬ 
pensable. Therefore, the example we provide, fulfills the 
requirement concerning the ratio of diagonalization to in¬ 
tegration time. The latter is supposed to be the bottle¬ 
neck during the execution of the optimization procedure. 
We investigate the speed-up (SU) defined as 


SU(P,X,Y) = 


T y (P= 1) 


T X {P) ’ 

with T denotes time spent on the computation and 

x,Ye{i,n,0}, 


(23) 


(24) 

where first two symbols correspond to parallelism based 
on the application of openMP and openMP+MPI respec¬ 
tively and 0 is associated with sequential solution. The 
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FIG. 6. Convergence study: a) ground-state energy Eg ver¬ 
sus intermolecular parameter a close to the minimum for the 
different background field size M\ b) limit of Eg for M —> oo; 
c) limit of the optimal intermolecular parameter as (mini¬ 
mizing Eg) for M —> oo. The plots in b) and c) represent the 
finite-size scaling analysis. 
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FIG. 7. Convergence study: Ground-state energy Eg versus 
number of Gaussians Ng taken to represent the atomic basis 
and for the background field size M = 250 and the number of 
SMP nodes P = 56. 


calculations were performed for the energy optimization 
at given ion configuration, close to the energy minimum, 
allowing to collect CPU time consumed by OP. The 
measurement were covered on HP server consisting of 
96 computational nodes each supplied with the two 8- 
cored Intel Xeon e5-2670 2.60GHz processors. The main 
board supported SMP exposing one logical 16-core pro¬ 
cessor per node thus P refers to the number of SMP 
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nodes. Physically, the communication among nodes was 
provided by the InfiniBand 4X QDR interface. The 
SU(P, II, I) (Fig.§ exhibits Amdahl law-like behaviour 
. This law states that speed-up limit in the strong¬ 
scaling regime can be described in terms of the following 
formula 

SU (P) = _ 1 L , (25) 

4 J I p 

where / is the part of the program susceptible for the 
parallelization. The value of / was found by means of 
fit the Amdahl’s law to the obtained data. Approximate 
value of / comes directly from the fit (see Fig. [8] for de¬ 
tails) and maximal SU can also be estimated by 

SU max « lim SU(P). (26) 

P—yoo 

We found / « 0.97 and SU max ~ 33, which confirms 
suitability of the application of our scheme. However, 
for the sake of completeness, we investigated the Gaus¬ 
sian number threshold associated with the process-pool 
solution efficiency. 


C. Number of Gaussians and Efficiency Threshold 


As mentioned above, the efficiency of the process-pool 
solution depends on the workload assigned to each of the 
process in the pool. Notably, Nq and the total number 
of integrals (associated with M (221) are the most im¬ 


portant factors, influencing robustness of the proposed 
approach. 

We have performed measurements of computation time 
as a function of Nq for a different number of P (see 
Fig. 10I . For large number of Gaussians the optimiza¬ 
tion time has a universal scaling T ~ N^ , with p ss 4, 


meaning that the two-particle integrals (201 are the most 


computationally expensive, as expected. 

Following j55] we introduce the extended-Amdahl law 
to include potential communication overhead. In our 
case, the potentially most time-consuming (among MPI 
communication routines used) MPI_Gather routine 
scales lineraly with P. Taking this into account the 
speed-up can be approximated by the following formula 


SU a 


t (P,Y,X) = 


1 


1-f-S+P 


SP 


(27) 


where 5 is constant to be determined. We performed fit 
of (271 to the speed-up as a function of P for Nq € 
{3, 5, 7} as we show in Fig. [9] As expected / decreases 
with decreasing Nq, but 6 < 10 -4 even for Nq = 3. 
This value is negligible for the reasonable P (the number 
of integrals is the upper bound) for any /. The lack 
of the communication overhead originates not only from 
the utilization of InfiniBand interface and linear scaling 
of the communication routine, but also from the small 
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FIG. 8. Speed-up (SU) as a function of number of processes 
P. Amdahl law curve fited to data. Each point probed 50 
times. 
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/i SU(P,II,I) = (0.0303 + 0.9697/P)' 1 
- ./ SU(°°,II,I) = 32.9833 

4 


25 


20 


15 


■o 

CD 

CD 

« 10 


N g = 3, SU(P,II,I) = (0.5626 - S + 0.4374/P + 6 P)' 

N g = 5, SU(P,II,I) = (0.1713 - 6 + 0.8287/P + 5 P)' 

4 Nq = 7, SU(P,II,I) = (0.0735 - 6 + 0.9265/P + 6 P)' 1 


A 

.+■ 


---f 


„h-+ 4 ' h 

. 4 . 4 . t . 4 . 4 . 4 


h.-*■.+.■«-.■*-.* 


10 


20 


30 

P 


40 


50 


60 


FIG. 9. The relative speed-up SU (231 versus number of nodes 
P for Nq £ {3,5,7} measured 100 times. Note that when 
fitting the so-called extended Amdahl law, communication 
overhead factor 5 ~ 0. Note that for Ng = 3 only 44% of the 
program is performed parallelly. 


amount of data sent by each process to the root (e.g. 
~ 80B for P = 50). Hence the deviation from linearity 
for higher values of P (see Fig. 101 comes from breaking 
the principal assumption: 


For the lower numbers of Gaussians, the time of diagonal- 
ization (performed sequentially) becomes comparable or 
greater to that consumed by the integrals computations. 

The analysis performed above allows to describe the 
boundaries where the proposed solution is effective. How¬ 
ever, from the users perspective the most compelling 
feature is the absolute speed-up SU(P,II,0), as it is 
the metric for the time save. In the next paragraph we 
present this result. 
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FIG. 10. Computation time as a function of Ng for sequential, 
only openMP, and MPI+openMP solutions. Note the linear 
behavior in the regime of large Ng on log-log scale, suggesting 
T ~ Nq. Values found from fit (see key) are consistent with 
ideal case p = 4, where whole time is spent on calculating 
two-body integrals (20). As one-body integrals have p = 2, 
values tend to drift from 4. 


D. Absolute Speed-up 


Whilst analysis of the speed-up in strong scaling 
regime allowed us to investigate the efficiency gain in 
terms of a number of processes engaged in computation 
it is interesting and important to answer what is the abso¬ 
lute acceleration SU(P,II,0). Obviously, this quantity 
still depends on P. In Tab. [I] we show the values of the 
speed-up (231 for different number of nodes P. The ex¬ 


tremal cas e[P = 56 with both levels of parallelization) 
the speed-up 


SU (56, II, 0) = 303.418. (28) 

However, even for the lower number of SMP nodes (P) 
the absolute speed-up is excellent. For the sake of com¬ 
pleteness, we retrieved the acceleration associated with 
openMP utilization (fourth column in Tab. [I]). As each 
SMP node consisted of 16 cores the number of active 
threads was the same. We obtained SU(P,II,I) by 
means of the identity 


SU(P,II,I) 


SU (P, II, 0) 
SU(P,I,0) ■ 


(29) 


TABLE I. Values of the speed-up (SU) for one- and two-level 
parallelism for the different number of nodes P. 


p 

SU(P,I,0) 

SU(P,II,0) 

SU(P,II,I) 

2 

1.866 

25.304 

13.561 

6 

4.969 

67.382 

13.561 

12 

8.288 

112.391 

13.561 

18 

11.479 

155.656 

13.561 

24 

14.174 

192.207 

13.561 

30 

16.870 

228.763 

13.561 

36 

16.779 

227.531 

13.561 

42 

19.026 

258.007 

13.561 

48 

19.085 

258.811 

13.561 

56 

22.375 

303.418 

13.561 


mechanical approach allowing to treat the electronic cor¬ 
relations in a consistent manner. This means that we 
combine the second-quantization aspect (evaluation of 
energies for different many-particle configurations) with 
an explicit evaluation of the renormalized wave func¬ 
tions (first-quantization aspect) in the resultant corre¬ 
lated many-particle state. The number of microscopic 
parameters that are necessary for description of the phys¬ 
ical system can be meaningful in many cases. Hence, 
their computations become challenging as an effect of 
numerical complexity caused by the vast number of in¬ 
tegrations to be performed Here we have addressed 
in detail the part of the whole problem that is associ¬ 
ated with the single-particle basis optimization. We have 
proposed the scheme based on the process-pool concept 
enhanced by the two-level parallelism, and test it uti¬ 
lizing self-made generic implementation m configured 
for the specific computational problem - {H%) n chain. 
The proposed approach is intuitive and has allowed us 
to speed up the calculations significantly (of the order of 
10 2 ) while preserving its generic character. Employing 
process-pool solution to other systems is then straight¬ 
forward. 

Since the considered physical example serves as an il¬ 
lustration of the elaborated scheme capability, one may 
consider engaging it to a wide class of computationally 
advanced physical problems tractable within the frame¬ 
work of EDABI or similar methods. Such problems cover: 

• lattice vibration (phonon) spectrum via the 

so-called direct method (where all but few symme¬ 
tries are lost, increasing the number of integrals 
dramatically - from 510 to over 10 000 for (i? 2 ) n 
chain); 


We found good speed-up ratio ~ 13 (while the upper 
bound is 16) coming from collapsing of nested loops (201. 


• calculation of the electron—lattice coupling pa¬ 
rameters in direct space; 


V. CONCLUSIONS 

We have presented an effective computational ap¬ 
proach related to the EDABI method - quantum- 


• electronic structure calculations of the realistic 
two— and three-dimensional atomic and molec¬ 
ular crystals (e.g. hydrogen, lithium hydride), 
where both the number of atomic orbitals and the 
background field increase essentially. 
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Neither form of the starting Hamiltonian nor the diag- 
onalization scheme choice is essential for the applicabil¬ 
ity of the method, allowing us to incorporate other ap¬ 
proaches such as the Gutzwiller approximation 0H5H2U 
or Gutzwiller Wave Function - Diagramatic Expansion 
[35] to study molecular and extended spatially systems. 
In this manner, one can address e.g. the fundamental 
question of metallization in correlated systems HSU ESQ 
with the explicit evaluation of an model parameters. So 
far we have been able to solve exactly the chain with 
N = 3, 4, 5, and 6 molecules and the results are of simi¬ 
lar type |29]. The finite-size type of scaling on the basis 
of these results requires additional analysis. 
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Appendix A: EDABI method: A brief overview 


from a physical point of view, as the single-particle or¬ 
bital adapts then itself to the presence of other particles 
(electrons). In other words, the correlations induced by 
the predominant interaction terms (~ U and K, : j ) influ¬ 
ence the size and shape of the states {u>i(r)}. This means 
that the orbitals get renormalized in the process of the 
correlated state formation. Nonetheless, it is not a priori 
determined that negligence of the higher virtually excited 
states in the expansion (13 ) is minimized is such a man¬ 
ner. This should be tested and is one of the subjects of 
our current research. This is not the primary topic of 
this paper so we shall not dwell upon it any further here. 

Note also that by selecting the diagonalization of 
many-particle Hamiltonian in the second-quantized form, 
one avoids writing the many-determinant al expansion of 
the multiparticle wave function, as is the case in the 
Cl methods. However, out of our formulation one can 
obtain the function \t(ri,..., r jy) and in particular, de¬ 
fine many-particle covalency 0 . This transition from the 
Fock space back to the Hilbert space is possible as the 
two languages of description are equivalent in the non- 
relativistic situation (for a lucid and didactic exposition 
of the first- and second-quantization schemes and their 
equivalence see e.g. [33]). The principal limitation of 
our method is the circumstance that it can be applied 
directly only to relatively small systems when the exact 
diagonalization is utilized. 


The Exact Diagonalization Ab Initio (EDABI) ap¬ 
proach combines first- and second-quantization aspects 
when solving the many-particle problem as expressed for¬ 
mally by its Hamiltonian. 

The starting Hamiltonian ([8]) contains all possible dy¬ 
namical processes starting from two-body interaction 
V(r — r') in the coordinate (Schrodinger) representation. 
Its version (101 is already truncated and limited to two- 
state (here two-site) terms, i.e., the three- and four-site 
terms have been neglected. The rationale behind this 
omission is that, as shown elsewhere 00, already the 
two-site terms (matrix elements) ~ Jy and Vij are much 
smaller than those ~ U and K^. Parenthetically, all the 
terms are taken into account in the starting atomic basis 
composing the Wannier function. Nonetheless, there is 
no principal obstacle in including all those terms in the 
case of the small systems considered here. 

The second characteristic feature of EDABI is the 
single-particle basis optimization which composes the 
main topic of this paper. If the basis defining the start¬ 
ing Wannier-function basis were complete (i.e., L — > oo 
in Eq. ©), then no basis optimization is required and 
an exact solution is achieved. However, as our basis {u>i} 
is incomplete one, in our view, we are forced to readjust 
the basis so that the system dynamics (correlations) are 
properly accounted for. This introduces a variational as¬ 
pect to our solution, since we introduce a variable wave 
function size, adjustable in the interacting (correlated) 
state. This very feature represents one of the factors 
defining the method. Such adjustment is also reasonable 


Appendix B: Convergence of the single-particle 
energy for infinite system 


In (21a I we have the microscopic parameter contained 
in the single-particle energy expression, namely 



(Bl) 


where i labels lattice site (node) at which we calculate 
this single-particle energy, and j goes over the whole sys¬ 
tem. Also, ipi is the Is Slater-type orbital centered on 
that site. For the sake of clarity, we disregard the or- 
thogonalization procedure, as the one-body parameters 
contain, strictly speaking, a linear combination of inte¬ 
grals in STO basis. 

For an infinite system, the sum ]TF constitutes a series 
of the form m 


c = - (v»i| v 2 \^i) - Y y [ r _ 2 R; | y 


-a 2 -2 a-^ — +2(a + 


IRi 


i’ij (B2) 

—2a|R.,-j | 


which diverges to —oo. In this Appendix we show that 
there is an effective single-particle energy with no diver¬ 
gence for the infinite systems (case more general than 
those discussed in [71 f50]l. 
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We start from Hamiltonian (21c I 


2. Ion-ion repulsion 


Utot — 'H-ext + (B3) 

where ion-ion interaction is defined in the classical limit, 
i.e., 


Vi-i - 


2 


(B4) 


and R,; ;/ is the distance between sites i and j. We ana¬ 
lyze next the remaining contributions, term by term. 


1. Intersite Coulomb term 


Similarly, we can rewrite (B41 to the form 




i 






R, 




Ki) 


o) 




j(i) 


j(i) 


R; 


2 

IR^'I 

(B9) 


Again, the average of the latter term disappears for one 
particle per site. 


The intersite Coulomb term from (21b I cab be rewrit¬ 
ten in the form 


1 

2 


ij 


if<“> 


iff 1 ) 


3. Total Hamiltonian 


\ Y1 KijSbiS^ +1J2J2 (1 — 5ij) Kijhi We rearrange ( |21c[ ) obtaining so that the new form of 
i j Hamiltonian is 


if< 2 ) 

/-^-v 

2 ^ — ) Kijfij 

i j 


iff 3 ) 





(B5) 


'H.tf = H e oS + "^Hubbard + T^Sn- (BIO) 


where <5n,j = (rq — 1) and Sij is Kronecker’s delta. 

We observe that when all sites are taken into account, 
the terms id 1 ) and K ) 2 ) are equivalent. We can rewrite 
them as follows 


The new terms are 


T~L e os 



l 


(BH) 


H k =K& - K ( 3 ) + 2 ^ E (1 - S u) K ^i = ( B6 ) 

* 3 

^(°)_ A -( 3 ) + 2 

* j'(») 

where j(i) denotes the neighborhood of site i. Likewise, 


with ef ff = e* + 1/2E i(i) (2/|Rij| + Kij), 


^Hubbard ( B12 ) 

r/ cr i 


< B? ) 

*¥=3 » j(i) 

We can write finally that 

+ ( B8 ) 

* i(») 

+ 2 E^E^)- 

* j(») 


^<5™ =4 ( B13 ) 



Note that for half-filling (nj = 1 and the last part and 
(id 0 )) disappears. 


The last question is whether the effective single-particle 
energy is now convergent. 
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4. Convergence of the single-particle energy 


We can take from eqs. (B2) and (Bill and rear¬ 
range it in a following manner 


e eff =e- 


1 


E 


I Riil +Kij 


=a 2 — 2a + ^ — ■ 




IRj 


R, 


K, 


+ 2 [a + 


IRi- 


(B14) 

3 — 2a|R -ij | 


— o? — 2a H- ^ ^ 2 
3 


IRi 


o 2a|Rij| 


^-is: 


The latter part disappears in the classical limit R () ^ 
a -1 , where 


K « - w~\' < B15 > 

and the remaining part Yhj 2 + |R,j| ^ e _2Q l RiJ l is 

convergent. 
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